{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lab 19 - k-nearest neighbors\n",
    "\n",
    "The *k-nearest neighbors* algorithm predicts based on the values of the k closest training data.  For example, a 3-nearest neighbor algorithm will find the 3 closest data points (using the Euclidean distance) in the training data and use them to make a prediction.\n",
    "\n",
    "If we are classifying (trying to predict qualitative value), the prediction is the class that appears the most in the k neighbors.\n",
    "\n",
    "If we are performing regression (trying to predict a quantitative value), the prediction is the mean of the y values of the k neighbors.\n",
    "\n",
    "## Classifier\n",
    "\n",
    "We will return to the city services survey data from Lab 12 (Decision tree classifiers).  Recall that this data is collected by the city of [Somerville, MA](https://en.wikipedia.org/wiki/Somerville,_Massachusetts) asking residents about their happiness, as well as ratings of city services. \n",
    "\n",
    "The link to download the data is [https://archive.ics.uci.edu/ml/machine-learning-databases/00479/SomervilleHappinessSurvey2015.csv](https://archive.ics.uci.edu/ml/machine-learning-databases/00479/SomervilleHappinessSurvey2015.csv)\n",
    "\n",
    "The data columns are:\n",
    "\n",
    "- D = decision attribute (D) with values 0 (unhappy) and 1 (happy) \n",
    "- X1 = the availability of information about the city services \n",
    "- X2 = the cost of housing \n",
    "- X3 = the overall quality of public schools \n",
    "- X4 = your trust in the local police \n",
    "- X5 = the maintenance of streets and sidewalks \n",
    "- X6 = the availability of social community events \n",
    "\n",
    "Attributes X1 to X6 have values 1 to 5."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    " \n",
    "from sklearn.preprocessing import MinMaxScaler\n",
    "    \n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from sklearn.neighbors import KNeighborsRegressor\n",
    "\n",
    "from sklearn.metrics import confusion_matrix\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As in Lab 12, we will read the data into the dataframe `city`, giving the columns more descriptive names in the process."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_column_names = [\"happy\",\"city_info\",\"housing_cost\", \"school_quality\", \\\n",
    "                    \"trust_police\", \"streets_sidewalks\", \"community_events\"]\n",
    "city = pd.read_csv(\"../data/SomervilleHappinessSurvey2015.csv\", \\\n",
    "                    encoding = \"utf-16le\",names = new_column_names, \\\n",
    "                    header = 0)\n",
    "city.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define a variable `X` to contain all columns except `happy`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details> <summary>Answer:</summary>\n",
    "<code>X = city.iloc[:,1:7]</code>\n",
    "</details>\n",
    "\n",
    "Define a variable y to be the `happy` column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details> <summary>Answer:</summary>\n",
    "<code>y =city[\"happy\"]</code>\n",
    "</details>\n",
    "\n",
    "Split your X and y data into training and testing data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details> <summary>Answer:</summary>\n",
    "<code>X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)</code>\n",
    "</details>\n",
    "\n",
    "The following code creates a 3-nearest neighbor classifier (k = 3), fits the training data to it, and makes predictions for the test data. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "k3nn = KNeighborsClassifier(n_neighbors = 3)\n",
    "k3nn.fit(X_train, y_train)\n",
    "y_pred = k3nn.predict(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute a confusion matrix for the true values and predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details> <summary>Answer:</summary>\n",
    "<code>confusion_matrix(y_test, y_pred, labels = [1,0])</code>\n",
    "</details>\n",
    "\n",
    "Compute the sensitivity, specificity, precision, and accuracy from the confusion matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details> <summary>Answer:</summary>\n",
    "<code>tn, fn, fp, tp = confusion_matrix(y_test, y_pred, labels = [1,0]).ravel()\n",
    "\n",
    "sensitivity = tp/(tp + fn)\n",
    "specificity = tn/(tn + fp)\n",
    "precision = tp/(tp + fp)\n",
    "accuracy = (tp + tn)/(tp + tn + fp + fn)\n",
    "\n",
    "print(\"Sensitivity:\",sensitivity)\n",
    "print(\"Specificity:\",specificity)\n",
    "print(\"Precision:\", precision)\n",
    "print(\"Accuracy:\",accuracy)</code>\n",
    "</details>\n",
    "\n",
    "How does changing k, the number of neighbors used to make the prediction, affect the performance of this classifier?\n",
    "\n",
    "The results from the decision tree in Lab 12 were: \n",
    "\n",
    "Sensitivity: 0.5584415584415584\n",
    "\n",
    "Specificity: 0.8181818181818182\n",
    "\n",
    "Precision: 0.7818181818181819\n",
    "\n",
    "Accuracy: 0.6783216783216783\n",
    "\n",
    "How does the k-nearest neighbor classifier compare to the decision tree classifier?\n",
    "\n",
    "## Regressor\n",
    "\n",
    "To test k-nearest neighbors for regression, we will use the insurance data from Labs 7, 8, and 13.  Recall we are trying to predict the insurance cost, a quantitative value.  \n",
    "\n",
    "If you don't have the dataset, download it from GitHub: [https://github.com/stedy/Machine-Learning-with-R-datasets/blob/master/insurance.csv](https://github.com/stedy/Machine-Learning-with-R-datasets/blob/master/insurance.csv)\n",
    "\n",
    "In this data, each row represents an insurance policy and the 7 columns contain the following information about it:\n",
    "- age: age of policy holder\n",
    "- sex: sex of policy holder\n",
    "- bmi: boday mass index (bmi) of policy holder.  bmi is a (sometimes unreliable) measurement of body fat in adults\n",
    "- children: number of children (dependents) on the policy\n",
    "- smoker: whether the policy holder is a smoker\n",
    "- region: region of the country the policy holder lives in\n",
    "- charges: price for insurance policy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read in the insurance data, replacing the qualitative columns with dummy variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create an X variable with the independent variable columns (everything except the charges column)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a y variable with the `charges` column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Split your X and y data into training and testing data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following code creates a 3-nearest neighbor regressor (k = 3), fits the training data to it, and makes predictions for the test data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "ik3nn = KNeighborsRegressor(n_neighbors = 3)\n",
    "ik3nn.fit(iX_train, iy_train)\n",
    "iy_pred = ik3nn.predict(iX_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute the mean squared error for your predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Scaling data (aka normalization)\n",
    "\n",
    "When the columns have different scales, the largest column will dominate.  We can get better results by scaling all of our columns to be between 0 and 1.  The scaling formula is:\n",
    "\n",
    "$$x_{scaled} = \\frac{x - x_{\\min}}{x_{\\max} - x_{\\min}}$$\n",
    "\n",
    "We can use a built in function in sci-kit learn to do the scaling:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "scaler = MinMaxScaler(feature_range=(0, 1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iX_train_scaled = scaler.fit_transform(iX_train)\n",
    "iX_train_scaled"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Scale your X test data.  We do not need to scale the y data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Built a 3-nearest neighbor regressor with the scaled training data and use it to make predictions for the scaled test data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute the new mean squared error.  Does scaling improve the 3-nearest neighbor regressor?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To figure out which value of k to use, we can write a loop to try all values of k between 1 and 20, and compute the mean squared error for each one.  The pseudo-code to do this is:\n",
    "\n",
    "<code>\n",
    "create an empty list\n",
    "loop k from 1 to 20:\n",
    "    create a k-nearest neighbor regressor\n",
    "    fit the training data to the k-nearest neighbor regressor\n",
    "    make predictions for the test data\n",
    "    compute the mean squared error for the predictions\n",
    "    store the mean squared error in the list\n",
    "</code>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "mses = []\n",
    "for k in range(1,21):\n",
    "    iknn_scaled = KNeighborsRegressor(n_neighbors = k)\n",
    "    iknn_scaled.fit(iX_train_scaled, iy_train)\n",
    "    iy_pred_scaled = iknn_scaled.predict(iX_test_scaled)\n",
    "    mse = ((iy_pred_scaled - iy_test)**2).mean()\n",
    "    mses.append(mse)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details> <summary>Answer:</summary>\n",
    "<code>\n",
    "mses = []\n",
    "for k in range(1,21):\n",
    "    iknn_scaled = KNeighborsRegressor(n_neighbors = k)\n",
    "    iknn_scaled.fit(iX_train_scaled, iy_train)\n",
    "    iy_pred_scaled = iknn_scaled.predict(iX_test_scaled)\n",
    "    mse = ((iy_pred_scaled - iy_test)**2).mean()\n",
    "    mses.append(mse)\n",
    "</code>\n",
    "</details>\n",
    "\n",
    "Plot the list of mean squared errors.  The lowest one will correspond to the best k."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just as with linear regression, we can see if there is a pattern to which values are predicted correctly and which are not.  Plot a scatter plot with the true y test values on the x axis, and the predicted value - the true value on the y axis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}